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Abstract 

In  this  chapter,  we  develop  a  Bayesian  Pairwise  Classifier  framework  that  is  suitable  for  pattern 
recognition  problems  involving  a  moderately  large  number  of  classes,  and  apply  it  to  two  character 
recognition  datasets.  A  C  class  pattern  recognition  problem  (e.g.  C  =  26  for  recognition  of  English 
Alphabet)  is  divided  into  a  set  of  two-class  problems.  For  each  pair  of  classes,  a  Bayesian  classifier 
based  on  a  mixture  of  Gaussians  (MOG)  is  used  to  model  the  probability  density  functions  conditioned 
on  a  single  feature.  A  forward  feature  selection  algorithm  is  then  used  to  grow  the  feature  space,  and 
an  efficient  technique  is  developed  to  obtain  a  MOG  in  the  larger  feature  space  from  the  MOG’s  in 
the  smaller  spaces.  Apart  from  improvements  in  classification  accuracy,  the  proposed  architecture  also 
provides  valuable  domain  knowledge  such  as  identifying  what  features  are  most  important  in  separating 
a  pair  of  characters,  relative  distance  between  any  two  characters,  etc. 


1  Introduction 

There  are  two  phases  in  a  typical  pattern  recognition  problem:  The  learning /training  phase  and  the  gen- 
eralization  phase.  In  the  learning  phase,  a  predictor  or  classifier  is  designed  from  already  labeled  training 
examples.  In  the  generalization  phase,  a  novel  example  is  assigned  a  class  label  by  the  trained  classifier. 
The  ability  of  the  classifier  to  generalize  to  novel  examples  not  seen  during  training  is  central  to  pattern 
recognition.  It  is  typically  measured  in  terms  of  the  empirical  generalization  accuracy  defined  as  the  fraction 
of  novel  examples  (test  examples)  that  were  assigned  the  right  class  label  by  the  trained  classifier. 

Depending  on  the  domain  of  application,  the  raw  input  could  be  a  set  of  observed  properties  (e.g. 
symptoms  of  a  disease),  a  one  dimensional  signal  (e.g.  voice  recognition,  text  recognition  etc.),  or  even  an 
image  (e.g.  face  recognition,  character  recognition  etc.).  It  is  neither  feasible  nor  practical  to  learn  a  mapping 
from  such  complex  input  spaces  to  class  labels.  Hence,  a  preprocessing  stage  involving  data  conditioning 
followed  by  feature  extraction  is  used  to  transform  the  raw  sensory  input  into  a  small  set  of  features  that  the 
classifier  can  operate  on.  This  is  all  the  more  true  for  character  recognition  problems  where  the  input  is  an 
image  of  handwritten  characters.  Thus,  the  two  stage  learning  process  can  be  expressed  in  terms  of  a  pair 
of  mappings: 

X  €  I  y  €  a;  G  (1) 

where  the  first  mapping  $  :  X  .F,  called  the  feature  extractor,  transforms  an  input  vector  x  in  the  input 
space  X  into  a  feature  vector  y  in  some  feature  space  T,  and  the  second  mapping  #  :  X*  ft,  called  the 
classifier,  assigns  a  class  label  u;  €  ft  =  {u;i,a;2, . . .  ,a;c}  to  the  feature  vector  y.  For  example,  in  the 
first  character  recognition  problem  that  is  considered  in  this  chapter,  the  input  images  of  characters  are 
transformed  into  a  set  of  16  properties  such  as  mean  positions  of  on  pixels,  their  variance,  and  mean  edge 
count  from  left  to  right  and  bottom  to  top,  etc.  In  the  second  problem,  30  tangent  vectors  are  computed 
from  each  character  image.  Although  domain  knowledge  is  used  to  extract  these  features  and  reduce  the 
16384  dimensional  (a  128  x  128  character  image)  input  space  to  a  30  dimensional  feature  space,  it  is  not 
necessary  that  all  the  extracted  features  will  be  actually  useful  in  classification.  Hence  a  smaller  set  of  these 
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1.2  Probabilistic  learning  framework 

Once  a  suitable  feature  space  T  is  obtained  by  feature  extraction/selection  methods,  one  has  to  discriminate 
among  different  classes  in  the  feature  space  T,  The  input  to  the  classifier  mapping  ^  ^  ^  f2  is  a  feature 
vector  y  =  for  any  input  x  €  I.  The  probabilistic  learning  framework,  popular  in  the  statistical  pattern 
recognition  community,  is  used  for  modeling  $  in  this  work.  There  are  a  number  of  classifiers  with  different 
properties  that  have  evolved  from  this  framework. 

In  the  probabilistic  learning  framework  [1,6],  input  patterns  and  class  labels  are  assumed  to  be  stochas¬ 
tically  independent  and  identically  distributed  random  variables  X  and  Q  respectively.  Since  feature  vectors 
are  obtained  from  the  input  patterns,  they  are  also  assumed  to  be  random  variables  Y,  In  the  following 
description  of  the  probabilistic  learning  framework,  only  variables  Y  and  Cl  are  used  since  once  $  is  fixed, 
for  every  X  there  is  a  corresponding  Y.  The  salient  features  of  the  probabilistic  learning  framework  are  as 
follows: 


•  Y  and  are  sampled  from  an  unknown  joint  probability  density  function  py,Q(y,cj). 

•  Input  patterns  belong  to  one  of  the  C  classes  with  the  prior  probability  of  a  sample  being  in  class  Uc 

given  by  P{Cl  =  Wc)  =  The  priors  are  constrained  by  P{^c)  =  1- 

•  The  overall  probability  density  function  p(y)  is  a  mixture  of  C  class  conditional  probability  density 
functions  computed  in  the  feature  space  i.e.  {p{Y  =  y\Ct  =  ujc)  =  p(y|^c)}^i: 

c 

p(y)  =  H^(‘^c)p(y|wc)  (4) 

c=l 


•  The  posteriori  probability  P{Q  =  udY  =  y)  =  P{uc\y)  of  pattern  y  belonging  to  class  cjc  is  given  by 
the  Bayes  rule: 

■PK)p(y|wc) 


f'(wcly)  = 


p(y) 


(5) 


where  the  denominator  (see  4)  is  a  normalizing  factor  such  that  ^i^c\y)  =  1- 


•  The  classifier  $(y)  tries  to  estimate  these  posterior  probabilities  {P(a;c|y)}^i.  Using  these  estimates, 
it  assigns  class  label  uj{y)  based  on  the  maximum  aposteriori  probability  (MAP)  rule, 


^(y)  =  ^^(y)  =  arg  max  P(a;c|y). 

c=l.,,C 


The  misclassification  error  for  the  MAP  rule  is  given  by 

^MAP  ($)=  /  (1-  max  P(a;c|y))rfy- 

J  c=l...C 


(6) 


(7) 


•  A  training  set  X  =  {Xc}^^i  C  T,  where  Xc  is  the  set  of  training  inputs  in  class  Wc,  is  available  for 
supervised  learning  of  After  feature  extraction,  the  corresponding  training  data  is  denoted  hy  y  = 
{3^c}S=i  U  P,  where  for  each  sample  x  e  X,  there  is  a  corresponding  y  €  3^,  such  that  y  =  ’i^(x). 


1.3  Classifier  Taxanomy 

There  are  two  broad  categories  into  which  most  of  the  classifier  architectures  can  be  divided:  DENSITY  BASED 
and  REGRESSION  BASED  [7]. 

1.  Density  based  classifiers  estimate  the  class  conditional  probabiliy  density  functions  {p(y|c^c)}S=i 
and  use  these  to  compute  the  aposteriori  probabilities  using  the  Bayes  rule  (5).  Once  the  estimated 
aposterior  probabilities  are  available,  the  MAP  rule  (6)  can  be  used  to  assign  y  a  class  label  u;(y).  In 
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2.1  Pairwise  Classifier  Architecture 


Figure  1  shows  the  BPC  framework.  Each  classifier  (j>ij  has  an  associated  feature  extractor  denoted  by 
:  I  Tij  that  transforms  an  input  x  G  I  into  a  feature  vector  t/?ij(x)  G  The  output  of  (l>ij  is  an 
estimate  of  the  posterior  probability  Pij{uJi\'ilJij{x))  =  1  - 

Each  (l>ij  is  implemented  as  a  Bayesian  classifier  that  uses  two  mixture  of  Gaussians  (MOG),  one  for  class 
cJi  and  one  for  class  uj,  to  model  the  probability  density  functions  p{'i{)ij{x)\uk),  k  =  i^j: 


Piipij (x) \uk)=  G  (rpij (x) ; )  ,  (10) 

a=l 

where  is  the  number  of  Gaussians  in  the  mixture  for  class  ujk^  and  (G  Pij)  and  are  the 
mean  vector  and  covariance  matrix  of  the  Gaussian  in  the  mixture  of  class  Uk  for  the  classifier  The 
Gaussian  function  G  is  given  by: 


1 

,  . -  -  exp 

V^(2^ 


(11) 


where  y  =  ipijix)  and  d  =  \Tij\  =  the  dimensionality  of  the  feature  space  Tij.  Bayes  rule  is  used  to  compute 
the  classifier  output: 


p{^pijiK)\Ui)Pij(u}i) 


p{i>ij{x)\uJi)Pij{u}i)  +  p{lpij{x)\Uj)PijioJj) 
where  Pij(u)k)  axe  the  estimated  class  priors  based  on  the  training  data: 


,k  =  i,j. 


(12) 


Ai(Wfc)  = 


|Ar,|  +  |A(,f 


(13) 


The  problem  of  finding  the  right  set  of  features  (Pij)  and  the  set  of  parameters 
Va  =  1 . .  k  =  i,j,  and  that  of  finding  the  right  number  of  mixtmes  n^k  '^^  are  discussed  next. 


2.2  Feature  Selection 


Feature  selection  is  done  separately  and  independently  for  each  pairwise  classifier.  Let  F  =  {1,2, ...,D} 
denote  the  index  set  of  all  features  and  y  =  tpij  (x)eirycF  denote  the  feature  vector  corresponding  to  x. 
In  order  to  select  the  most  discriminating  features  for  the  class  pair  a  relevance  R{Tij)  is  assigned 

to  the  feature  set  Tij  based  on  the  log  odds  of  estimated  class  posteriors  over  the  training  set  U  Aj: 


n(T  ^  Inr  I  1 


x^Xi 


1  :^j(wihM^ 
x'fei  Ai(Wi|V’ii(x)) 


(14) 


Note  that  the  relevance  depends  on  estimates  of  pairwise  posteriors  of  Pij{ijJk\'ipij{x)),  which  in  turn  depends 
both  on  the  feature  space  Tij  and  the  parameters  used  for  modeling  the  pdf  in  Equation  (10).  The  algorithm 
for  feature  selection  for  the  class  pair  (uJi.oJj)  is  summarized  below: 

1.  Initialize  Pij  =  argmax/^p  i?(/). 

2.  Augment  the  feature  set  sequentially  as  follows: 

(a)  Find  the  next  best  feature  /  to  add  to  Tij: 


/^-arg  max  RiPij  +  f), 

/6F— 


(15) 


where  {Pij  +  /)  denotes  the  feature  set  formed  by  augmenting  feature  /  in  the  feature  set  Pij. 
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final  decision 


Figure  1:  Pairwise  classifier  architecture:  (^) 
pairwise  classifiers  with  respective  feature  selec¬ 
tors 


EBliBffBBBafl 

CCCCclIccccO 

sPPM  f  FF 
SsSSS'S®^ 

hvXximi 

Figure  2:  Some  examples  of  letters  in  LETTER-I 
dataset  [21] 


2.3  Combining  the  pairwise  classifiers 

The  outputs  of  the  (^)  classifiers  can  be  combined  to  obtain  the  final  output  in  two  ways:  (z)  by  simple 
voting  [22],  or  (ii)  by  using  the  MAP  rule  on  an  estimate  of  the  overall  aposterior  probabilities  obtained 
from  the  outputs  of  the  pairwise  classifiers  [23].  In  the  voting  combination  scheme,  a  count  c(u;ifclx)  of  the 
number  of  (^)  classifiers  that  labeled  x  into  class  ujk^ 

c(wife|x)  =  '^Ii<j}ik{il’ik{x))  <  0.5)  -t-  ^  J(0fc<(V’*i(x))  >  0.5),  (25) 

i<k  i>k 

is  used.  Here  I  {hoot)  is  the  indicator  function,  which  is  1  when  the  hool  argument  is  true,  and  0  otherwise. 
The  input  x  is  assigned  the  class  label  for  which  the  count  is  maximum,  i.e.  c<;(x)  =  argmaxfc=i...c  c(a;Ai|x). 

In  another  approach  to  combining  pairwise  classifiers,  proposed  recently  [23],  the  overall  posterior  prob¬ 
abilities  Pi  =  P(a;i|x)  Vz  =  1 . . .  C  are  estimated  for  some  x  from  the  (^)  posterior  probabilities  given  by  12 
as  follows.  Denote  niij  =  \Mi\ -h  \Xj\,  rij  =  <l>ij{ipij{x))  and  i/ij  =  The  goal  is  to  find  an  estimate  pi  of 

true  posteriors  P(a;,|x)  such  that  i/ij  is  close  to  nj,  Vz  ^  j.  Since  there  are  (7—1  independent  parameters  but 
(^)  equations,  it  is  not  possible  in  general  to  estimate  Pi  so  that  i/ij  =  Vij  Vz  ^  j.  Hence  only  an  approximate 
solution  is  sought.  The  closeness  criteria  that  forms  the  objective  function  for  finding  p  =  (pi,P25  •  •  -  Pc)  is 
the  weighted  KL-distance  between  and  Uij : 

=  ry  log  ^  +  ( 1  -  nj )  log  ^ ^  (26) 

i<j  L  ''0 

This  results  in  the  following  algorithm. 


1.  Start  from  an  initial  guess  for  pi  =  and  evaluate  corresponding  i/ij  using  the  definition  above. 

2.  Repeat  the  following  updates  for  z  =  1, 2, . . . ,  (7, 1, 2, . . .  till  convergence: 
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Classifier 

LETTER-I 

LETTER-II 

Train  Test 

Train  Test 

ib-NN 

91.2  (0.31)  89.9  (0.42) 

91.9  (0.39)  89.5  (0.44) 

MLP 

80.2  (0.67)  79.3  (0.73) 

79.3  (0.71)  76.2  (0.81) 

MLC 

84.4  (0.34)  82.7  (0.49) 

81.4  (0.44)  79.5  (0.51) 

BPC(1,V) 

87.2  (0.45)  85.4  (0.57) 

83.7  (0.52)  82.1  (0.63) 

BPC(1,M) 

87.6  (0.39)  85.3  (0.49) 

84.9  (0.51)  83.3  (0.59) 

BPC(n,V) 

88.9  (0.26)  86.2  (0.33) 

85.6  (0.36)  83.1  (0.40) 

BPC(n,M) 

89.5  (0.24)  87.6  (0.35) 

87.9  (0.39)  86.3  (0.46) 

Table  1:  Average  Training  and  test  Accuracy  (standard  deviations)  for  multi  layered  perceptrons(MLP), 
Maximum  likelihood  classifier  (MLC),  Bayesian  pairwise  classifier  with  single  Gaussian  with  voting 
combination  method  (BPC(1,V)),  and  MAP  estimate  combination  (BPC(1,M)),  and  Bayesian  pairwise 
classifier  with  mixture  of  Gaussian  for  voting  (BPC(n,V))  and  MAP  estimate  (BPC(n,M))  combination. 


Classifier 

LETTER-I 

LETTER-II 

FEATURES  GAUSSIANS 

FEATURES  GAUSSIANS 

BPC(l) 

11.3  1.0 

13.5  1.0 

BPC(n) 

8.2  1.1 

9.8  1.5 

Table  2:  Number  of  features  used  (FEATURES)  and  number  of  Gaussians  in  the  mixture  of  Gaussians  pdf’s 
(GAUSSIANS),  averaged  over  all  the  (^)  Pairwise  classifiers  for  both  BPC(l)  and  BPC(n)  case. 


number  of  features  required.  Distribution  for  total  usage  of  different  features  over  all  the  pairwise  classifiers 
is  shown  in  Figure  3.  For  both  BPC(l)  and  BPC(n),  the  distribution  of  usage  of  different  features  looks 
considerably  similar  for  both  datasets.  In  BPC(n)  classifiers,  the  average  number  of  Gaussians  per  pairwise 
classifier  is  more  than  1  but  the  number  of  features  required  is  significantly  less  than  those  required  by 
BPC(l)  classifier.  Further,  the  fact  that  some  pairwise  classifiers  required  more  than  1  Gaussians  per  class 
to  model  their  pdf’s,  shows  that  the  data  sets  were  not  exactly  unimodal  and  this  explains  the  difference  in 
performance  of  BPC(l)  and  BPC(n)  classifiers. 


4  Domain  Knowledge  Extraction 

The  pairwise  architecture  with  Bayesian  classifiers  based  on  mixture  of  Gaussians,  makes  it  possible  to 
extract  several  kinds  of  domain  knowledge  from  the  trained  predictors,  for  example,  features  that  are  useful 
for  distinguishing  between  particular  pair  of  classes,  a  measure  of  distance  between  classes  etc.  Domain 
knowledge  extracted  firom  the  character  recognition  datasets  is  described  in  this  section. 

4.1  Overall  Importance  of  Features 

Figure  3  shows  the  histogram  of  the  number  of  times  a  feature  was  actually  used  in  the  pairwise  classifiers  for 
both  BPC(l)  and  BPC(n)  variants.  For  the  LETTER-I  dataset,  the  least  used  feature  was  vertical  position 
of  the  box  (feature  2),  and  the  most  used  feature  was  edge  count  from  bottom  to  top  (feature  15).  This 
kind  of  domain  knowledge  could  reduce  the  cost  of  measuring  different  properties  (features)  of  the  objects 
once  it  is  known  what  properties  (features)  are  more  useful  than  others  for  the  overall  task.  Such  domain 
knowledge  is  very  useful  in  applications  like  remote  sensing  classification  problems,  where  certain  types  of 
sensors  are  more  useful  than  others  for  a  given  application  [19]. 
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Table  3:  The  lower  triangular  matrix  entries  denote  the  number  of  features  used  for  each  pair  of  classes  and 
the  upper  triangular  matrix  denotes  the  rounded  relevance  measure  in  the  corresponding  feature  space. 


number  of  modes  for  a  class  in  a  given  feature  space.  The  BPC  framework  addresses  both  these  problems 
efficiently  by  the  growing  and  pruning  algorithm  described  in  section  2.2.  It  also  highlights  the  fact  that  the 
number  of  modes  in  a  distribution  is  conditioned  on  the  feature  space. 

Consider  the  scatter  plots  of  class  pairs  (B/M),  (B/W),  (D/W),  and  (H/W)  in  Figures  4  and  5.  For 
all  these  cases  a  single  Gaussian  would  not  have  been  able  to  model  the  desired  pdf’s.  Moreover,  different 
number  of  Gaussians  are  required,  in  general,  for  the  two  classes  within  each  pairwise  classifier.  Thus 
the  flexibility  of  the  BPC  architecture  in  not  only  choosing  the  right  features,  but  also  in  automatically 
deciding  the  right  number  of  Gaussians  to  model  the  pdf’s  for  different  classes,  was  found  to  be  useful  for 
the  two  datasets.  Such  flexibility  is  not  available  in  conventional  classifiers.  Thus  domain  knowledge  about 
the  feature  space  together  with  information  about  the  number  of  modes  of  each  class  in  the  corresponding 
feature  space  can  be  extracted  from  the  BPC  architecture. 

4.4  Distance  between  classes 

In  conventional  methods  where  a  single  classifier  is  used  for  the  whole  C  class  problem,  an  estimate  of 
the  distance  between  two  classes  could  be  obtained  from  the  “confusion  matrix”  of  training/validation  set. 
Higher  is  the  number  of  class  u)i  examples  getting  classified  into  class  uj  and  vice  versa,  the  “closer”  are  the 
two  classes.  Unfortunately,  this  kind  of  estimate  is  influenced  by  the  type  of  the  classifier  used,  instead  of 
solely  being  a  property  of  the  domain  itself. 

The  BPC  framework  provides  a  classifier  independent  measure  of  distance  between  class  pair  (oJi^Uj)  in 
terms  of  the  relevance  function  R{Tij).  If  J^ij  is  a  feature  space  in  which  the  discrimination  between  two 
classes  is  high,  then  smaller  relevance  implies  that  it  is  harder  to  distinguish  between  the  classes,  which  in  turn 
implies  that  the  two  classes  are  “close”  to  each  other  in  some  sense.  Table  3  contains  the  relevance  measures 
between  all  pairs  of  classes  for  LETTER-I  dataset.  Since  relevance  is  symmetric,  only  the  upper  triangular 
matrix  is  used  for  relevance  measures.  The  lower  triangular  matrix  contains  the  number  of  features  required 
to  distinguish  between  two  classes.  Relevance  measures  that  are  towards  the  lower  and  higher  end  are  bold 
faced.  The  pair  of  classes  that  were  found  to  be  “close”  to  each  other  in  terms  of  the  relevance  function 
using  the  LETTER-I  dataset  were  (B/E),  (B/R),  (0/Q),  (K/X),  (P/F),  etc.  Classes  that  were  found  to  be 
“distant”  from  each  other  were  (H/Z),  (L/S),  (P/V),  (A/L),  etc.  These  results  are  particularly  interesting  as 
they  show  how  the  BPC  framework  is  able  to  extract  expected  domain  knowledge.  The  distance  information 
is  also  useful,  for  example,  to  hierarchically  cluster  the  characters. 
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Figure  5:  Data  distribution  for  LEI 
BPC  during  the  letter  recognition  t. 
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